MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

and 


CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING 
DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES 


A.I. Memo No. 1516 July, 1995 

C.B.C.L. Paper No. 115 


The Logical Problem of Language Change 


Partha Niyogi and Robert C. Berwick 

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. 


Abstract 

This paper considers the problem of language change. Linguists must explain not only how languages 
are learned but also how and why they have evolved along certain trajectories and not others. While the 
language learning problem has focused on the behavior of individuals and how they acquire a particular 
grammar from a class of grammars Q, here we consider a population of such learners and investigate the 
emergent, global population characteristics of linguistic communities over several generations. We argue 
that language change follows logically from specific assumptions about grammatical theories and learning 
paradigms. In particular, we are able to transform parameterized theories and memoryless acquisition 
algorithms into grammatical dynamical systems, whose evolution depicts a population’s evolving linguistic 
composition. We investigate the linguistic and computational consequences of this model, showing that 
the formalization allows one to ask questions about diachronic that one otherwise could not ask, such as 
the effect of varying initial conditions on the resulting diachronic trajectories. From a more programmatic 
perspective, we give an example of how the dynamical system model for language change can serve as 
a way to distinguish among alternative grammatical theories, introducing a formal diachronic adequacy 
criterion for linguistic theories. 
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1 Introduction 

As is well known, languages change over time. Lan¬ 
guage scientists have long been occupied with describ¬ 
ing phonological, syntactic, and semantic change, often 
appealing to the analogy between language change and 
evolution. Some even suggest that language itself is a 
complex adaptive system (see Hawkins and Gell-Mann, 
1989). For example, Lightfoot (1991, chapter 7, pp. 163— 
65ff.) talks about language change in this way: “Some 
general properties of language change are shared by other 
dynamic systems in the natural world. . .In population 

biology and linguistic change there is constant flux. If 

one views a language as a totality, as historians often do, 
one sees a dynamic system.” Indeed, entire books have 
been devoted to the description of language change us¬ 
ing the terminology of population biology: genetic drift, 
dines, and so forth 1 However, these analogies have rarely 
been pursued beyond casual and descriptive accounts. 2 
In this paper we formalize these intuitions, to the best 
of our knowledge for first time, as a concrete, computa¬ 
tional, dynamical systems model, and investigating the 
consequences of this formalization. 

In particular, we show that a model of language 
change emerges as a logical consequence of language ac¬ 
quisition, a point made by Lightfoot (1991). We shall 
see that Lightfoot’s intuition that languages could be¬ 
have just as though they were dynamical systems is es¬ 
sentially correct, as is his proposal for turning language 
acquisition models into language change models. We can 
provide concrete examples of both “gradual” and “sud¬ 
den” syntactic changes, occurring over time periods of 
many generations to just a single generation. 3 

Many interesting points emerge from the formaliza¬ 
tion, some programmatic: 

• Learnabihty is a well-known criterion for the ad¬ 
equacy of grammatical theories. Our model pro¬ 
vides an evolutionary criterion: By comparing the 
trajectories of dynamical linguistic systems to his¬ 
torically observed trajectories, one can determine 
the adequacy of linguistic theories or learning al¬ 
gorithms. 

• We derive explicit dynamical systems correspond¬ 
ing to parametrized linguistic theories (e.g., the 
Head First/Final parameter in head-driven phrase 
structure grammars or government-binding gram¬ 
mars) and memoryless language learning algo¬ 
rithms (e.g., gradient ascent in parameter space). 

• We illustrate the use of dynamical systems as a 
research tool by considering the loss of Verb Sec¬ 
ond position in Old French as compared to Mod¬ 
ern French. We demonstrate by computer model¬ 
ing that one grammatical parameterization in the 

1 For a recent example, see Nichols (1992), Linguistic Di¬ 
versity in Space and Time. 

2 Some notable exceptions are Kroch (1990) and Clark and 
Roberts (1993). 

3 Lightfoot 1991 refers to these sudden changes acting 
over a single generation as “catastrophic” but in fact this 
term usually has a different sense in the dynamical systems 
literature. 


literature does not seem to permit this historical 
change, while another does. We can more accu¬ 
rately model the time course of language change. In 
particular, in contrast to Kroch (1989) and others, 
who mimic population biology models by impos¬ 
ing S-shaped logistic curves on possible language 
changes by assumption, we derive the time course 
of language change from more basic assumptions, 
and show that it need not be S-shaped; rather, an 
S-shape can emerge from more fundamental prop¬ 
erties of the underlying dynamical system. 

• We examine by simulation and traditional phase- 
space plots the form and stability of possible 
“diachronic envelopes” given varying alternative 
language distributions, language acquisition algo¬ 
rithms, parameterizations, input noise, and sen¬ 
tence distributions. The results bear on models 
of language “mixing”; so-called “wave” models for 
language change; and other proposals in the di¬ 
achronic literature. 

• As topics for future research, the dynamical sys¬ 
tem model provides a novel possible source for ex¬ 
plaining several linguistic changes including: (a) 
the evolution of modern Greek metrical stress as¬ 
signment from proto-Indo-European; and (b) Bick- 
erton’s (1990) “creole hypothesis,” concerning the 
striking fact that all creoles, irrespective of linguis¬ 
tic origin, have exactly the same grammar. In the 
latter case, the “universality” of creoles could be 
due a parameterization corresponding to a com¬ 
mon condensation point of a dynamical system, a 
possibility not considered by Bickerton. 

2 An Acquisition-Based Model of 
Language Change 

How does the combination of a grammatical theory and 
learning algorithm lead to a model of language change? 
We first note that just as with language acquisition, there 
is a seeming paradox in language change: it is generally 
assumed that children acquire their caretaker (target) 
grammars without error. However, if this were always 
true, at first glance grammatical changes within a popu¬ 
lation could seemingly never occur, since generation after 
generation children would successfully acquire the gram¬ 
mar of their parents. 

Of course, Lightfoot and others have pointed out the 
obvious solution to this paradox: the possibility of slight 
misconvergence to target grammars could, over several 
generations, drive language change, much as speciation 
occurs in the population biology sense: 

As somebody adopts a new parameter setting, 
say a new verb-object order, the output of 
that person’s grammar often differs from that 
of other people’s. This in turn affects the lin¬ 
guistic environment, which may then be more 
likely to trigger the new parameter setting in 
younger people. Thus a chain reaction may 
be created. (Lightfoot, 1991, p. xxx) 

We pursue this point in detail below. Similarly, just 




as in the biological case, some of the most commonly 
observed changes in languages seem to occur as the result 
of the effects of surrounding populations, whose features 
infiltrate the original language. 

We begin our treatment by arguing that the problem 
of language acquisition at the individual level leads log¬ 
ically to the problem of language change at the group 
or population level. Consider a population speaking 
a particular language 4 . This is the target language— 
children are exposed to primary linguistic data (PLD) 
from this source, typically in the form of sentences ut¬ 
tered by caretakers (adults). The logical problem of lan¬ 
guage acquisition is how children acquire this target lan¬ 
guage from their primary linguistic data—to come up 
with an adequate learning theory. We take a learning 
theory to be simply a mapping from primary linguis¬ 
tic data to the class of grammars, usually effective, and 
so an algorithm. For example, in a typical inductive 
inference model, given a stream of sentences, an acqui¬ 
sition algorithm would simply update its grammatical 
hypothesis with each new sentence according to some 
preprogrammed procedure. An important criterion for 
learnability (Gold, 1967) is to require that the algorithm 
converge to the target as the data goes to infinity (iden¬ 
tification in the limit). 

Now suppose that we fix an adequate grammatical 
theory and an adequate acquisition algorithm. There are 
then essentially two means by which the linguistic com¬ 
position of the population could change over time. First, 
if the primary linguistic data presented to the child is al¬ 
tered (due to any number of causes, perhaps to presence 
of foreign speakers, contact with another population, dis- 
fluencies, and the like), the sentences presented to the 
learner (child) are no longer consistent with a single tar¬ 
get grammar. In the face of this input, the learning 
algorithm might no longer converge to the target gram¬ 
mar. Indeed, it might converge to some other grammar 
(g 2 ); or it might converge to 32 with some probability, 
gs with some other probability, and so forth. In either 
case, children attempting to solve the acquisition prob¬ 
lem using the same learning algorithm could internalize 
grammars different from the parental (target) grammar. 
In this way, in one generation the linguistic composition 
of the population can change. 5 

Second, even if the PLD comes from a single tar¬ 
get grammar, the actual data presented to the learner 
is truncated, or finite. After a finite sample sequence, 
children may, with non-zero probability, hypothesize a 
grammar different from that of their parents. This can 
again lead to a differing linguistic composition in suc¬ 
ceeding generations. 

In short, the diachronic model is this: Individual chil¬ 
dren attempt to attain their caretaker target grammar. 


4 In our analysis this implies that all the adult members 
of this population have internalized the same grammar (cor¬ 
responding to the language they speak). 

5 Sociological factors affecting language change, affect lan¬ 
guage acquisition in exactly the same way, yet are abstracted 
away from the formalization of the logical problem of lan¬ 
guage acquisition. In this same sense, we similarly abstract 
away such causes here. 


After a finite number of examples, some are success¬ 
ful, but others may misconverge. The next generation 
will therefore no longer be linguistically homogeneous. 
The third generation of children will hear sentences pro¬ 
duced by the second—a different distribution—and they, 
in turn, will attain a different set of grammars. Over suc¬ 
cessive generations, the linguistic composition evolves as 
a dynamical system. 

On this view, language change is a logical consequence 
of specific assumptions about: 

1. the grammar hypothesis space —a particular 
parametrization, in a parametric theory; 

2 . the language acqmstion device —the learning algo¬ 
rithm the child uses to develop hypotheses on the 
basis of data; 

3. the primary linguistic data —the sentences pre¬ 
sented to the children of any one generation. 

If we specify (1) through (3) for a particular gener¬ 
ation, we should, in principle, be able to compute the 
linguistic composition for the next generation. In this 
manner, we can compute the evolving linguistic compo¬ 
sition of the population from generation to generation; 
we arrive at a dynamical system. We now proceed to 
make this calculation precise. We first review a standard 
language acquisition framework, and then show how to 
derive a dynamical system from it. 

2.1 The Language Acquisition Framework 

Let us state our assumptions about grammatical theo¬ 
ries, learning algorithms, and sentence distributions. 

1. Denote by Q, a family of possible (target) gram¬ 
mars. Each grammar g G Q defines a language L(g) C 
E* over some alphabet E in the usual way. 

2. Denote by P a distribution on E* according to 
which sentences are drawn and presented to the learner. 
Note that if there is a well defined target, g t , and only 
positive examples from this target are presented to the 
learner, then P will have all its measure on L(gt), and 
zero measure on sentences outside Suppose n examples 
are drawn in this fashion, one can then let V n = (E*) n 
be the set of all n-example data sets the learner might be 
presented with. Thus, if the adult population is linguis¬ 
tically homogeneous (with grammar g\) then P = Pi. 
If the adult population speaks 50 percent L(g 1 ) and 50 
percent £(< 72 ) then P = \Pi + \P^- 

3. Denote by A the acquisition algorithm that chil¬ 
dren use to hypothesize a grammar on the basis of in¬ 
put data. A can be regarded as a mapping from V n 
to Q. Thus, acting upon a particular presentation se¬ 
quence d n G P n , the learner posits a hypothesis A(d n ) = 
h n G Q. Allowing for the possibility of randomization, 
the learner could, in general, posit hi G Q with probabil¬ 
ity pi for such a presentation sequence d n . The standard 
(stochastic version) learnability criterion (Gold, 1967) 
can then be stated as follows: 

For every target grammar, gt G G, with positive-only 
examples presented according to P as above, the learner 
must converge to the target with probability 1, i.e., 

Proh[A(d „) = g t ] —>n-oo 1 



For an analysis of learnability issues for memoryless 
algorithms in finite parameter spaces, consult Niyogi 
(1995) . 

2.2 From Language Learning to Popuation 
Dynamics 

The framework for language learning has learners at¬ 
tempting to infer grammars on the basis of linguistic 
data. At any point in time, n, (i.e., after hearing n ex¬ 
amples) the learner has a current hypothesis, h, with 
probability p n (h). What happens when there is a pop¬ 
ulation of learners? Since an arbitrary learner has a 
probability p n (h) of developing hypothesis h (for every 
h E (7), it follows that a fraction p n (h) of the population 
of learners internalize the grammar h after n examples. 
We therefore have a current state of the population after 
n examples. This state of the population might well be 
different from the state of the parent population. As¬ 
sume for now that after n examples, maturation occurs, 
i.e., after n examples the learner retains the grammat¬ 
ical hypothesis for the rest of its life. Then one would 
arrive at the state of the mature population for the next 
generation. 13 This new generation now produces sen¬ 
tences for the following generation of learners according 
to the distribution of grammars in its population. Then, 
the process repeats itself and the linguistic composition 
of the population evolves from generation to generation. 

We can now define a discrete time dynamical system 
by providing its two necessary components: 

A State Space: a set of system states, iS. Here the 
state space is the space of possible linguistic composi¬ 
tions of the population. Each state is described by a 
distribution P pop on Q describing the language spoken 
by the population.' At any given point in time, t, the 
system is in exactly one state s E S\ 

An Update Rule: how the system states change from 
one time step to the next. Typically, this involves spec¬ 
ifying a function, /, that maps s t E S to s t+1 6 * 8 

For example, a typical linear dynamical system might 
consist of state variables x (where x is a fc-dimensional 
state vector) and a system of differential equations x' = 
Ax (A is a matrix operator) which characterize the evo¬ 
lution of the states with time. RC circuits are a simple 
example of linear dynamical systems. The state (cur¬ 
rent.) evolves as the capcitor discharges through the re¬ 
sistor. Population growth models (for example, using 
logistic equations) provide other examples. 

6 Maturation seems to be a reasonable hypothesis in this 
context. After all, it seems even more unreasonable to imag¬ 
ine that learners are forever wandering around in hypoth¬ 
esis space. There is evidence from developmental psychol¬ 
ogy to suggest that this is the case, and that after a certain 
point children mature and retain their current grammatical 
hypotheses forever. 

1 As usual, one needs to be able to define a cr-algebra on the 
space of grammars, and so on. This is unproblematic for the 
cases considered in this paper because the set of grammars 
is finite. 

8 In general, this mapping could be fairly complicated. For 
example, it could depend on previous states, future states, 

and so forth; for reasons of space we do not consider all pos¬ 
sibilities here. For reference, see Strogat.z, 1993. 



Figure 1: A simple illustration of the state space for the 
3-parameter syntactic case. There are 8 grammars. A 
probability distribution on these 8 grammars, as shown 
above, can be interpreted as the linguistic composition 
of the population. Thus, a fraction Pi of the population 
have internalized grammar, g\, and so on. 


As as linguistic example, consider the three parameter 
syntactic space described in Gibson and Wexler (1994). 
This defines 8 possible “natural” grammars. Thus Q has 
8 elements. We can picture a distribution on this space 
as shown in fig. 1. In this particular case, the state space 
is 

8 

5 = {P6R 8 |^P,: = 1} 

8 = 1 

Here we interpret the state as the linguistic compo¬ 
sition of the population. 9 For example, a distribution 
that puts all its weight on grammar g\ and 0 everywhere 
else indicates a homogeneous population that speaks a 
language corresponding to grammar g\. Similarly, a dis¬ 
tribution that puts a probability mass of 1/2 on g\ and 
1/2 on g 2 denotes a population (nonhomogeneous) with 
half its speakers speaking a language corresponding to 
g i and half speaking a language corresponding to g'>. 

To see in detail how the update rule may be com¬ 
puted, consider the acquisition algorithm, A. For exam¬ 
ple, given the state at time f, ( P pop ,t ), the distribution 
of speakers in the parental population, one can obtain 
the distribution with which sentences from S* will be 
presented to the learner. To do this, imagine that the 
* th linguistic group in the population, speaking language 
Li, produces sentences with distribution Pi. Then for 
any u> E S*, the probability with which u> is presented 
to the learner is given by 

P(ui) = ^ Pi(u)P p0 p,t(i) 

i 

This fixes the distribution with which sentences are 
presented to the learner. The logical problem of lan¬ 
guage acquisition also assumes some success criterion for 
attaining the mature target grammar. For our purposes, 
we take this as being one of two broad possibilities: ei¬ 
ther (1) the usual Gold scenario of identification in the 
limit, what we shall call the limiting sample case; or (2) 


9 Not.e that we do not allow for the possibility of a single 
learner having more than one hypothesis at a time; an ex¬ 
tension to this case, in which individuals would more closely 
resemble the “ensembles” of particles in a thermodynamic 
system is left for future research. 



identification in a fixed, finite time, what we shall call 
the finite sample case. 10 

Consider case (2) first. Here, one draws n example 
sentences according to distribution P, and the acquisi¬ 
tion algorithm develops hypotheses (A{d n ) £ Q). One 
can, in principle, compute the probability with which 
the learner will posit hypothesis hi after n examples: 

Finite Sample: Prob[A(d n ) = hi] = p n {hi) (1) 

The finite sample situation is always well defined—the 
probability p n always exists. 11 . 

Now turn to case (1), the limiting case. Here learn- 
ability requires p n (gt) to go to 1, for the unique target 
grammar, g t , if such a grammar exists. However, in gen¬ 
eral there need not be a unique target grammar since 
the linguistic population can be nonhomogeneous. Even 
so, the following limiting behavior might still exist: 

Limiting Sample: lim Proh[A(d n ) = hfi] = p(hi) 

n—>oo 

.( 2 ) 

Turning from the individual child to the population, 
since the individual child internalizes grammar hi £ Q 
with probability p n {hi) in the “finite sample” case or 
with probability p(hi) “in the limit”, in a population of 
such individuals one would therefore expect a proportion 
Pn (hi ) or p(hi) respectively to have internalized grammar 
hi. In other words, the linguistic composition of the next 
generation is given by P pop ,t+i(hi) = p n {hi) for the finite 
sample case and by P pop ,t+i(hi) = p(hi) in the limiting 
sample case . In this fashion, 


Remarks. 1. For a Gold-learnable family of languages 
and a limiting sample assumption, homogeneous popu¬ 
lations are always stable. This is simply because each 
child and therefore the entire population always even¬ 
tually converges to a single target grammar, generation 
after generation. 

2. However, finite sample case is different from the 
limiting sample case. Suppose we have solved the mat¬ 
uration problem, that is, we know roughly the time, or 
number of examples N the learner takes to develop its 
mature (adult) hypothesis. In that case px(h) is the 
probability that a child internalizes the grammar h, and 
PN(h) is the percentage of speakers of Lh in the next 
generation. Note that under this finite sample analy¬ 
sis, even for a homogeneous population with all adults 

10 Of course, a variety of other success criteria, e.g., con¬ 
vergence within some epsilon, or polynomial in the size of 
the target grammar, are possible; each leads to potentially 
different language change model. We do not pursue these 
alternatives here. 

n This is easy to see for deterministic algorithms, Adet- 
Such an algorithm would have a precise behavior for every 
data set of n examples drawn. In our case, the examples 
are drawn in i.i.d. fashion according to a distribution P on 
E*. It is clear that p„(hfi = P[{d n \Adet(d„) = h l }]. For 
randomized algorithms, the case is trickier, though tedious, 
but the probability still exists because all the finite choice 
paths over all sequences of length n is enumerable. Previous 
work (Niyogi and Berwick, 1993,1994a,1994b) shows how to 
compute p n for randomized memoryless algorithms. 


speaking a particular language (corresponding to gram¬ 
mar, g, say), PN(g) will not be 1—that is, there will be 
a small percentage of learners who have misconverged. 
This percentage could blow up over several generations, 
and we therefore have potentially unstable languages. 

3. The formulation is very general. Any {A, G,V)} 
triple yields a dynamical system. 12 . In short: 

(Q, A, {/?}) —>■ V( dynamical system) 

4. The formulation also does not assume any particu¬ 
lar linguistic theory, learning algorithm, or distribution 
with which sentences are drawn. Of course, we have im¬ 
plicitly assumed a learning model, i.e., positive examples 
are drawn in i.i.d. fashion and presented to the learner. 
Our dynamical systems formalization follows as a log¬ 
ical consequence of this learning framework. One can 
conceivably imagine other learning frameworks—these 
would potentially give rise to other kinds of dynamical 
systems—but we do not formalize them here. 

This completes the abstract formulation of the dy¬ 
namical system model. Next, we choose specific linguis¬ 
tic theories and learning paradigms to model particular 
kinds of language changes, with the goal of answering 
the following questions: 

• Can we really compute all the relevant quantities 
to specify the dynamical system? 

• Can we evaluate the behavior (phase-space charac¬ 
teristics) of the resulting dynamical system? 

• Does the 

dynamical system model—the formalization—shed 
light on diachronic models and linguistic theories 
generally? 

In the remainder of this paper, we give some concrete 
answers to these questions within the principles and pa¬ 
rameters theory of modern linguistics. 

3 Language Change in Parametric 
Systems 

In previous works (Niyogi and Berwick, 1993, 1994a, 
1994b; Niyogi, 1995), we investigated the problem of 
learnability within parametric systems. In particular, we 
showed that the behavior of any memoryless algorithm 
can be modeled as a Markov chain. This analysis allows 
us to solve equations 1 and 2, and thus obtain the up¬ 
date equations of the associated dynamical system. Let 
us now show how to derive such models in detail. We 
first provide the particular Q,A,{Pi} triple, and then 
give the update rule. 

The learning system triple. 

1. Q: Assume there are n parameters—this leads to a 
space Q with 2 n different grammars. 

2. A : Let us imagine that the child learner follows 
some memoryless (incremental) algorithm to set 
parameters. For the most part, we will assume that 

12 Note that this probability could evolve with generations 
as well. That will complete all the logical possibilites. How¬ 
ever, for simplicity, we assume that this does not happen. 



the algorithm is the “triggering learning algorithm” 
or TLA (the single step, gradient-ascent algorithm 
of Gibson and Wexler, 1994) or one of the variants 
discussed in Niyogi and Berwick (1993). 

3. {.Pi}: Let speakers of the ith language, Li, in the 
population produce sentences according to the dis¬ 
tribution P{. For the most part we will assume in 
our simulations that this distribution is uniform on 
degree -0 (unembedded) sentences, exactly as in the 
learnability analysis of Gibson and Wexler 1994 or 
Niyogi and Berwick 1993. 

The update rule. We can now compute the update 
rule associated with this triple. Suppose the state of the 
parental population is P pop , n on Q. Then one can obtain 
the distribution P on the sentences of £* according to 
which sentences will be presented to the learner. Once 
such a distribution is obtained, then given the Markov 
equivalence established earlier, we can compute the tran¬ 
sition matrix T according to which the learner updates 
its hypotheses with each new sentence. From T one can 
finally compute the following quantities, one for the “fi¬ 
nite sample” case and one for the “limiting sample” case: 

Prob[ Learner’s hypothesis = hi £ Q after m examples] 

= {^( 1 ,..., 1 )' T ™}[*1 

Similary, making use of the limiting distributions of 
Markov chains (Resnick, 1992) one can obtain the fol¬ 
lowing (where ONE is a ^ x matrix with all ones). 

Prob[ Learner’s hypothesis = hi “in the limit”] 

= (1,..., iy(z- T+ one)- 1 

These expressions allow us to compute the linguistic 
composition of the population from one generation to 
the next according to our analysis of the previous sec¬ 
tion. 

Remark. The limiting distribution case is more com¬ 
plex than the finite sample case and requires some careful 
explanation. There are two possibilities. If there is just a 
single target grammar, then, by definition, the learners 
all identify the target correctly in the limit, and there 
is no further change in the linguistic composition from 
generation to generation. This case is essentially unin¬ 
teresting. If there are two or more target grammars, 
then recalling our analysis of learnability (Niyogi and 
Berwick, 1994), there can be no absorbing states in the 
Markov chain corresponding to the parametric grammar 
family. In this situation, a single learner will oscillate 
between some set of states in the limit. In this sense, 
learners will not converge to any single, correct target 
grammar. However, there is a sense in which we can 
characterize limiting behavior for learners: although a 
given learner will visit each of these states infinitely of¬ 
ten in the limit, it will visit some more often than others. 
The exact percentage the learner will be in a particular 
state is given by equation 3 above. Therefore, since we 
know the fraction of the time the learner spends in each 


grammatical state in the limit, we assume that this is 
the probability with which it internalizes the grammar 
corresponding to that state in the Markov chain. 

Summarizing, we provide the basic computational 
framework for modeling language change: 

1. Let 7 iy be the initial population mix, i.e., the per¬ 
centage of different language speakers in the com¬ 
munity. Assuming that the I th group of speakers 
produces sentences with probability Pi, we can ob¬ 
tain the probability P with which sentences in £* 
occur for the next generation of learners. 

2. From P we can obtain the transition matrix T for 
the Markov learning model and the limiting distri¬ 
bution of the linguistic composition 7^ for the next 
generation. 

3. The second generation now has a population mix 
of 7 T 2 - We repeat step 1 and obtain 773 . Continuing 
in this fashion, in general we can obtain 7 Tj + ^ from 

TTi- 

We next turn to specific applications of this model. 
We begin with a simple 3-parameter system as our first 
example, considering variations on the learning algo¬ 
rithm, sentence distributions, and sample size available 
for learning. We then consider a different, 5-parameter 
system already presented in the literature (Clark and 
Roberts, 1993) as one intended to partially characterize 
the change from Old French to Modern French. 

4 Example 1: A Three Parameter 
System 

The previous section developed the necessary mathemat¬ 
ical and computational tools to completely specify the 
dynamical systems corresponding to memoryless algo¬ 
rithms operating on Unite parameter spaces. In this ex¬ 
ample we investigate the behavior of these dynamical 
systems. Recall that every choice of (Q,A,{Pi}) gives 
rise to a unique dynamical system. We start by making 
specific choices for these three elements: 

1. Q : This is a 3-parameter syntactic subsystem de¬ 
scribed in Gibson and Wexler (1994). Thus Q has 
exactly 8 grammars, generating languages from L\ 
through Lg , as shown in the appendix of this paper 
(taken from Gibson and Wexler, 1994). 

2. A : The memory less algorithms we consider are the 
TLA, and variants by dropping either or both of the 
single-valued and greediness constraints. 

3. {-Pi} : For the most part, we assume sentences are 
produced according to a uniform distribution on 
the degree -0 sentences of the relevant language, i.e., 
Pi is uniform on (degree-0 sentences of) Li. 

Ideally of course, a complete investigation of di¬ 
achronic possibilities would involve varying Q, A, and 
V and characterizing the resulting dynamical systems 
by their phase space plots. Rather than explore this en¬ 
tire space, we first consider only systems evolving from 
homogeneous initial populations, under four basic vari¬ 
ants of the learning algorithm A. This will give us an 



Initial Language 

Change to Language? 

(—^ 2 ) 1 

2 (0.85), 6 (0.1) 

(+V2) 2 

2 (0.98); stable 

(—V2) 3 

6 (0.48), 8(0.38) 

(+V2) 4 

4 (0.86); stable 

(-V2) 5 

2 (0.97) 

(+V2) 6 

6 (0.92); stable 

(-V2) 7 

2 (0.54), 4(0.35) 

(+V2) 8 

8 (0.97); stable 


Table 1: Language change driven by misconvergence 
from a homogeneous initial linguistic population. A 
finite-sample analysis was conducted allowing each child 
learner 128 examples to internalize its grammar. Af¬ 
ter 30 generations, initial populations drifted (or not, as 
shown in the table) to different final linguistic composi¬ 
tions. 


initial grasp of how linguistic populations can change. 
Indeed, linguistic change has been studied before; even 
the dynamical system metaphor itself has been invoked. 
Our computational paradigm lets us say much more than 
these previous descriptions: ( 1 ) we can say precisely 
what the rates of change will be; ( 2 ) we can determine 
what diachronic population curve changes will look like, 
without stipulating in advance that they must be S- 
shaped (sigmoid) or not, and without curve fitting to 
a pre-dehned functional form. 

4.1 Homogeneous Initial Populations 

First we consider the case of a homogeneous 
population—no noise or confounding factors like foreign 
target languages. How stable are the languages in the 
3-parameter system in this case? To determine this, we 
begin with a finite-sample analysis with n = 128 ex¬ 
ample sentences (recall by the analysis of Niyogi and 
Berwick (1993,1994a,1994b) that learners converge to 
target languages in the 3-parameter system with high 
probability after hearing this many sentences). Some 
small proportion of the children misconverge; the goal 
is to see whether this small proportion can drive lan¬ 
guage change—and if so, in what direction. To give 
the reader some idea of the possible outcomes, let us 
consider the four possible variations in the learning al¬ 
gorithm (±Single-step, ±Greedy)holding fixed the sen¬ 
tence distributions and learning sample. 

4.1.1 Variation 1: A = TLA (+Single Step, 
+Greedy); Pi = Uniform; Finite Sample 
= 128 

Suppose the learning algorithm is the triggering learn¬ 
ing algorithm (TLA). The table below shows the lan¬ 
guage mix after 30 generations. Languages are numbered 
from 1 to 8 . Recall that +V2 refers to a language that 
has the verb second property, and — V2 one that does 
not. 


Observations. Some striking patterns regarding the 
resulting population mixes can be noted. 


1. First, all the + V2 languages are relatively stable, 
i.e., the linguistic composition did not vary signih- 


6 


cantly over 30 generations. This means that every 
succeeding generation acquired the target parame¬ 
ter settings and no parameter drifts were observed 
over time. 

2. In contrast, populations speaking — V2 languages all 
drift to +V2 languages. Thus a population speak¬ 
ing L\ winds up speaking mostly L 2 (85%). A 
population speaking language L 7 gradually shifts 
to a population with 54 percent speaking L 2 and 
35 percent speaking L 4 (with a smattering of other 
speakers) and apparently remains basically stable 
in this mix thereafter. Note that the relative sta¬ 
bility of +V2 languages and the tendency of — V2 
languages to drift to +V2 is exactly contrary to evi¬ 
dence in the linguistic literature. Lightfoot (1991), 
for example, claims that the tendency to lose V2 
dominates the reverse tendency in the world’s lan¬ 
guages. Certainly, both English and French lost 
the V2 parameter setting—an empirically observed 
phenomenon that needs to be explained. Immedi¬ 
ately then, we see that our dynamical system does 
not evolve in the expected manner. The reason 
could be due to any of the assumptions behind 
the model: the the parameter space, the learning 
algorithm, the initial conditions, or the distribu¬ 
tional assumptions about sentences presented to 
learners. Exactly which is in error remains to be 
seen, but nonetheless our example shows concretely 
how assumptions about a grammatical theory and 
learning theory can make evolutionary, diachronic 
predictions—in this case, incorrect predictions that 
falsify the assumptions. 

3. The rates at which the linguistic composition 
changes vary significantly from language to lan¬ 
guage. Consider for example the change of L\ to 
L 2 . Figure 2 below shows the gradual decrease in 
speakers of L\ over successive generations along 
with the increase in L 2 speakers. We see that over 
the first 6 or seven generations very little change 
occurs, but over the next 6 or seven generations 
the population changes at a much faster rate. Note 
that in this particular case the two languages differ 
only in the V2 parameter, so the curves essentially 
plot the gain of V2. In contrast, consider figure 3 
which shows the decrease of L 5 speakers and the 
shift to L 2 . Here we note a sudden change: over 
a space of just 4 generations, the population shifts 
completely. Analysis of the time course of language 
change has been given some attention in linguistic 
analyses of diachronic syntax change, and we re¬ 
turn to this issue below. 

4. We see that in many cases a homogeneous popula¬ 
tion splits up into different linguistic groups, and 
seems to remain stable in that mix. In other words, 
certain combinations of language speakers seem to 
asymptote towards equilibrium (at least through 
30 generations). For example, a population of L 7 
speakers shifts over 5-6 generations to one with 54 
percent speaking L 2 and 35 percent speaking L 4 
and remains that way with no shifts in the distri- 





V0S+V2 












Initial Language Change to Language? 

-V2 1 2 (0.41), 4 (0.19), 6 (0.18), 8 (0.13) 

+V2 2 2 (0.42), 4 (0.19), 6 (0.17), 8 (0.12) 

-V2 3 2 (0.40), 4 (0.19), 6 (0.18), 8 (0.13) 

+V2 4 2 (0.41), 4 (0.19), 6 (0.18), 8 (0.13) 

-V2 5 2 (0.40), 4 (0.19), 6 (0.18), 8 (0.13) 

+V2 6 2 (0.40), 4 (0.19), 6 (0.18), 8 (0.13) 

-V2 7 2 (0.40), 4 (0.19), 6 (0.18), 8 (0.13) 

+V2 8 _2 (0.40), 4 (0.19), 6 (0.18), 8 (0.13) 


Table 2: Language change driven by misconvergence. A 
finite-sample analysis was conducted allowing each child 
learner (following the TLA with single-value dropped) 
128 examples to internalize its grammar. Initial popula¬ 
tions were linguistically homogeneous, and they drifted 
to different linguistic compositions. The major language 
groups after 30 generations have been listed in this table. 
Note how all initially homogeneous populations tend to 
the same composition. 



Generations 

Figure 4: Time evolution of grammars using a greedy 
learning algorithm with no single value constraint in 
place. 


there is a tendency to gain V2 rather than lose V2, 
contrary to the empirical facts. 

As an example, fig. 4 shows the changing percentage 
of the population speaking the different languages start¬ 
ing off from a homogeneous population speaking L 5 . As 
before, learners who have not converged to the target in 
128 examples are the driving force for change here. Note 
again the time evolution of the grammars. For about. 
5 generations there is only a slight decrease in the per¬ 
centage of speakers of L 5 . Then the linguistic patterns 
switch rapidly over the next 7 generations to a relatively 
stable mix. 

4.1.3 Variations 3 & 4: —Greedy, iSingle 

Value constraint; Pi = Uniform; Finite 
Sample = 128 

Having dropped the single value constraint, we con¬ 
sider the next obvious variation in the learning algo¬ 
rithm: dropping greediness while varying the single value 
constraint. Again, our goal is to see whether this makes 
any difference in the resulting dynamical system. This 
gives rise to two different learning algorithms: ( 1 ) al¬ 
low the learning algorithm to pick any new grammar at 
most one parameter value away from its current hypoth¬ 
esis (retaining the single-value constraint, but without 
greediness, that is, the new grammar does not have to 
be able to parse the current input sentence); ( 2 ) allow 
the learning algorithm to pick any new grammar at each 
step (no matter how far away from its current hypothe¬ 
sis). 

In both cases, the population mix after 30 generations 
is the same irrespective of the initial language of the 
homogeneous population. These results are shown in 
table 3. 

Observations: 

1. Both algorithms yield dynamical systems that ar¬ 
rive at the same population mix after 30 genera¬ 
tions. The path by which they arrive at this mix 
is, however, not the same (see figure 5). 


Initial Language 

Change: |o Language? 

Any Language 
(Homogeneous) 

1 (0.11), 2 (0.16), 3 (0.10), 4 (0.14) 

5 (0.12), 6 (0.14), 7 (0.10), 8 (0.13) 


Table 3: Language change driven by misconvergence, us¬ 
ing two different acquisition algorithms that do not obey 
a local gradient-ascent, rule (a. greediness constraint). A 
finite-sample analysis was conducted with the learning 
algorithm following a. random-step algorithm or else a. 
single-step algorithm, along with 128 examples to inter¬ 
nalize its grammar. Initial populations were linguisti¬ 
cally homogeneous, and they drifted to different, linguis¬ 
tic compositions. The major language groups after 30 
generations have been listed in this table. Note that, all 
initially homogeneous populations converge to the same 
final composition. 


2. The final population mix contains all languages in 
significant proportion. This is in distinct, contrast, 
t.o the previous situations, where we saw that. —V 2 
languages were eliminated over time. 

4.2 Modeling Diachronic Trajectories 

With a. basic notion of how diachronic systems can evolve 
given different, learning algorithms, we turn next, to the 
question of population trajectories. While we can al¬ 
ready see that, some evolutionary trajectories have a. “lin¬ 
guistically classical” S-sha.pe, their smoothness can vary. 
However, our formalization allows us to say much more 
than this. Unlike the previous work in diachronic lin¬ 
guistics that, we are familiar with, we can explore the 
space of possible trajectories, examining factors that, af¬ 
fect. their evolutionary time course, without, assuming an 
a priori S-sha.pe. 

For example, Bailey (1973) proposed a. “wave” model 
of linguistic change: linguistic replacements follow an S- 
shaped curve over time. In Bailey’s own words (taken 
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Among the other factors that affect evolutionary tra- 
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L i \£i 2 are also equally likely, but their total proa¬ 
bility is 1 — p. 





Generations 

Figure 7: The evolution of L'> speakers in the community 
for various values of p (a parameter related to the sen- 
11 nee distributions Pi , see text). The algorithm used was 
the TLA, the inital population was homogeneous, speak¬ 
ing only L i. The curves for p = 0.05, 0.75, and 0.95 have 
been plotted as solid lines. 


dynamical system has a specific update procedure ac¬ 
cording to which the states evolve from some homoge¬ 
neous initial population. A more complete characteri¬ 
zation of the dynamical system would be achieved by 
obtaining phase-space plots of this system. Such phase- 
space plots are pictures of the state-space filled with 
trajectories obtained by letting the system evolve from 
various initial points (states) in the state space. 

4.3.1 Phase-Space Plots: Grammatical 
Trajectories 

We have described earlier the relationship between 
the state of the population in one generation and the 
next. In our case, let II denote an 8 -dinrensional vector 
variable (state variable). Specifically, II = ( 7 ri, . . ., 7 rg ) 7 
(with £A_i 7 r,;) as we discussed before. The following 
schema reiterates the chain of dependencies involved in 
the update rule governing system evolution. The state 
of the population at time t (in generations), allows us to 
compute the transition matrix T for the Markov chain 
associated with the nrenroryless learner. Now, depending 
upon whether we want. ( 1 ) an asymptotic analysis or ( 2 ) 
a finite sample analysis, we compute ( 1 ) the limiting 
behavior of T m as m (the number of examples) goes to 
infinity (for an asymptotic analysis), or (2) the value of 
T n (where N is the number of examples after which 
maturation occurs). This allows us to compute the next 
state of the population. Thus 11(7 + 1) = </(II(t)) where 
g is a complex non-linear relation. 

n (t) =^PonS*=>T=> T m =>- n (t + 1) 

If we choose a certain initial condition III, the system will 
evolve according to the above relation and one can obtain 
a trajectory of II in the 8 dimensional space over time. 
Each initial condition yields a unique trajectory and one 



Percentage of Speakers VOS-V2 

FigureS: Subspace of a phase-space plot. The plot shows 
(7ri( 7 ), 7ro (7)) as t varies, i.e., the proportion of speakers 
speaking languages L\ and L'> in the population. The 
initial state of the population was homogeneous (speak¬ 
ing language L\). The algorithm used was +Greedy 
—Single value. 


can then plot these trajectories obtaining a phase-space 
plot. Each such trajectory corresponds to a line in the 
8 -dinrensional plane given by Yl^=i = 1 - One cannot 
directly display such a high dimensional object, but we 
plot in figure 8 the projection of a particular trajectory 
onto a two dimensional subspace given by ( 7 Ti(7), ^(t)) 
(the proportion of speakers of L\ and L'>) at different 
points in time. 

As mentioned earlier, with a different initial condition 
we get a different grammatical trajectory. The complete 
state space picture is thus filled with all the different 
trajectories corresponding to different initial conditions. 
Fig. 9 shows this. 

4.3.2 Stability Issues 

The phase-space plots show that many initial condi¬ 
tions yield trajectories that seem to converge to a single 
point in the state space. In the dynamical systems termi¬ 
nology, this corresponds to a fixed point of the system— 
a population mix that stays at the same composition. 
Many natural questions arise at this stage. What are 
the conditions for stability? How many fixed points are 
there in a given system? How can we solve for them? 
These are interesting questions but detailed answers are 
not within the scope of the current paper. In lieu of a 
more complete analysis we state here a fixed point theo¬ 
rem that allows one to characterize the stable population 
mixes. 

First, some notational preliminaries. As before, let 
Pi be the distribution on the sentences of the ith lan¬ 
guage Li. From Pi , we can construct X), the transition 
matrix whose elements are given by the explicit proce¬ 
dure documented in Niyogi and Berwick (1993, 1994a., 
1994b). The matrix X) models a. +Greedy —Single value 
learner if the target, language is Li (with sentenps from 
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Figure 9: Subspace of a Phase-space plot. The plot 
shows ( 7 Ti(f), 7 T 2 (t)) as t varies for different nonhomoge- 
neous initial population conditions. The algorithm used 
was +Greedy — Single-value. 


the target produced with Pi). Similarly, one can obtain 
the matrices for other learning variants. Note that fixing 
the Pi's fixes the Tfi s and in so the Pfis are a different 
sort of “parameter” that characterize how the dynamical 
system evolves . 1 ' 3 If the State of the parent population at 
time t is 11 (f), then it is possible to show that the (true) 
transition matrix for iGreedy iSingle value learners is 
T = 1 KiitjTi- For the finite case analysis, the fol¬ 

lowing theorem holds: 

Theorem 1 (Finite Case) A fixed point (stable point) 
of the grammatical dynamical system (obtained by a 
PGreedy PSingle value learner operating on the 8 param¬ 
eter space with k examples to choose its final hypothesis) 
is a solution of the following equation: 

8 

n' = (7r 1 ,...,7r 8 ) = (l,...,l)'(^7r ! ;T 8 ) fc 

i=l 

Proof (Sketch): This equation is obtained simply by 
setting n(f + l) = 11(f). Note however, that this is an ex¬ 
ample of a nonlinear multidimensional iterated function 
map. The analysis of such dynamical systems is non¬ 
trivial, and our theorem by no means captures all the 
possibilities. I 

We can similarly state a theorem for the limiting 
(asymptotic) case analysis. 

Theorem 2 (Limiting or Asymptotic Analysis) 

A fixed point (stable point) of the grammatical dynami¬ 
cal system (obtained by a PGreedy PSingle value learner 

16 There are thus two distinct kinds of parameters in our 
model: first, parameters that define the 2 s ' languages and 
define the state-space of the system; and second, the Pf s 
the characterize the way in which the system evolves and 
are therefore the parameters of the complete grammatical 
dynamical system. 


operating on the 8 parameter space (given infinite exam¬ 
ples to choose its mature hypothesis) is a solution of the 
folio iv mg eg u at ion: 

8 

IT = (TTp, ..., 7 T 8 ) = (1,..., 1/(7 - nTi P ONE)- 1 

8 = 1 

where ONE is the 8x8 matrix with all its entries equal 
to 1 . 

Proof: Again this is trivially obtained by setting 11(7 + 
1 ) = 11(f). The expression on the right provides an ana¬ 
lytical expression for the update equation in the asymp¬ 
totic case. See Resnick (1992) for details. All the caveats 
mentioned in the proof section of the previous theorem 
apply here as well. I 

Remark. We have just touched the surface as far as 
the theoretical characterization of these grammatical dy¬ 
namical systems are concerned. The main purpose of 
this paper is to show that these dynamical systems ex¬ 
ist as a logical consequence of assumptions about, the 
grammatical space and an acquisition theory. We have 
exhibited only some preliminary simulations with these 
systems. From a theoretical perspective, it would be 
much more valuable to have complete characterizations 
of such systems. Strogatz (1993) suggests that nonlin¬ 
ear multidimensional mappings with greater than 3 di¬ 
mensions are likely to be chaotic. It is also interesting 
to note that iterated function maps define fractal sets . 
Such investigations are beyond the scope of this paper, 
and might well be a fruitful area for further research. 

5 Example 2: From Old French to 
Modern French; Clark and Roberts 
Analysis Revisited 

So far, our examples have been based on a 3-parameter 
linguistic theory for which we derived several different 
dynamical systems. Our goal was to concretely instan¬ 
tiate our philosophical arguments, sketching the factors 
that influence evolutionary trajectories. In this section, 
we briefly consider a different parametric linguistic sys¬ 
tem studied by Clark and Roberts, 1993. The histori¬ 
cal context in which Clark and Roberts advanced their 
linguistic proposal is the evolution of Modern French 
from Old French. Their parameters are intended to cap¬ 
ture some, but of course not all, of this change. They 
too use a learning algorithm—in their case, a genetic 
algorithm—to account for historical change but do not 
analyze their model from the dynamical systems view¬ 
point. Here we adopt, their parameterization, with all 
its strengths and weaknesses, but. consider an alternative 
learning paradigm and the dynamical systems approach. 

Extensive simulations in the earlier section reveal that, 
while the lea.rna.bilit.y problem of the 3-pa.ra.met.er space 
can be solved by stochastic hill climbing algorithms, the 
long term evolution of these algorithms have a. behavior 
that, is at. variance with the diachronic change actually 
observed in historical linguistics. In particular, we saw 
how there was a. tendency to gain rather than lose the V2 
parameter setting. While this could well be an artifact, of 
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the class of learning algorithms considered, a more likely 
explanation is that loss of V2 (observed in many of the 
world’s languages like French, English, and so forth) is 
due to an interaction of parameters and triggers other 
than those considered in the previous section. We inves¬ 
tigate this possibility and begin by first reviewing Clark 
and Roberts’ alternative parametric theory. 

5.1 The Parametric Subspace and Data 

We now consider a syntactic space involving the with 
5 (boolean-valued) parameters. We do not attempt 
to describe these parameters. The interested reader 
should consult Haegeman (1991) for details and Clark 
and Roberts (1993) for details. 

1. p\. Case assignment under agreement ( p\ = 1) or 
not (pi = 0). 

2. p 2 '. Case assignment under government (p 2 = 1) or 
not ((p 2 = 0). Relevant triggers for this parameter 
include “Adv V S”, “S V O”. 


of [*1**1] and 8 corresponding to parameter settings of 
[1***0]) and 4 grammars that generate ((s) V Y). 
Remark. Note that the sentence set Clark and Roberts 
considered is only a subset of the the total number of 
degree-0 sentences generated by the 32 grammars in 
question. In order to directly compare their model with 
ours, we have not attempted to expand the data set or fill 
out the space any further. As a result, all the grammars 
do not have unique extensional properties, i.e., some gen¬ 
erate the same set of sentences. 

5.2 The Case of Diachronic Syntax Change in 
French 

Continuing with Clark and Roberts’ analysis, within this 
parameter space, it is historically observed that the lan¬ 
guage spoken in France underwent a parametric change 
from the twelfth century to modern times. In particu¬ 
lar, they point out that both V2 and prodrop are lost, 
illustrated by examples like these: 

Loss of null subjects: pro-drop 


3. P 3 '. Nominative clitics. 

4. p 4 '. Null Subject. Here relevant triggers would in¬ 
clude “wh V SO”. 

5. P 5 '. Verb-second V2. Triggers include “Adv V S” , 
and “S V O”. 

These 5 parameters define a 32 grammar space. Each 
grammar in this parametrized system can be represented 
by a string of 5 bits depending upon the values of 
pi, . . . ,p> 5 , for instance, the first bit position corresponds 
to case assignment under agreement. We can now look 
at the surface strings (sentences) generated by each such 
grammar. For the purpose of explaining how Old French 
changed to Modern French, Clark and Roberts consider 
the following key sentences. The parameter settings re¬ 
quired to generate each sentence are provided in brack¬ 
ets; an asterisk is a “doesn’t matter” value and an “X” 
means any phrase. 

The Relevant Data 


adv V S 


SVO 

[*!**!] Qr ^***0] 

wh V S 0 


wh V S 0 


X (pro) V 0 

[*1*11] or [1**10] 

X V s 


X s V 

[**1*0J 

X S V [1***0] 


(S) V Y [*1*11J 



The parameter settings provided in brackets set the 
grammars which generate the sentence. For example, the 
sentence form “adv V S” (corresponding to quickly ran 
John), an incorrect word order in English) is generated 
by all grammars that have case assignment under govern¬ 
ment (the second element of the array set to l, p 2 = 1) 
and verb second movement (p$ = 1). The other parame¬ 
ters can be set to any value. Clearly there are 8 different 
grammars that can generate (alternatively parse) this 
sentence. Similarly there are 16 grammars that generate 
the form S V O (8 corresponding to parameter settings 
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(1) (Old French; +pro drop) 

Si brent (pro) grant joie la nuit 
‘thus (they) made great joy the night’ 

(2) (Modern French; —pro drop) 

* Ainsi s’amusaient bien cette nuit 
‘thus (they) had fun that night’ 

Loss of V2 

(3) (Old French; +V2) 

Lors oirent ils venir un escoiz de tonoire 
‘then they heard come a clap of thunder’ 

(4) (Modern French; —V2) 

* Puis entendirent-ils un coup de tonerre. ‘then they 
heard a clap of thunder’ 

Clark and Roberts observe that it has been argued 
this transition was brought about by the introduction 
of new word orders during the fifteenth and sixteenth 
centuries resulting in generations of children acquiring 
slightly different grammars and eventually culminating 
in the grammar of modern French. A brief reconstruc¬ 
tion of the historical process (after Clark and Roberts, 
1993) runs as follows. 

Old French; setting [11011] The language spoken 
in the twelfth and thirteenth centuries had verb-second 
movement and null subjects, both of which were dropped 
by the twentieth century. The sentences generated by 
the parameter settings corresponding to Old French are: 


Old French 


adv VS- 


SVO- 

[*!**!] Qr ^***0] 

wh V S O 


X (pro) V O 

[*1*11J or [1**10] 


Note that from this data set it appears that both 
the Case agreement and nominative clitics parameters 
remain ambiguous. In particular, Old French is in a 
subset-superset relation with another language (gener¬ 
ated by the parameter settings of 11111). In this case, 
possibly some kind of subset principle (Berwick, 1985) 



could be used by the learner; otherwise it is not clear how 
the data would allow the learner to converge to the Old 
French grammar in the first place. None of the ±Greedy, 
iSingle value algorithms would converge uniquely to the 
grammar of Old French. 

The string (X)VS occurs with frequency 58% and 
SV(X) occurs with 34% in Old French texts. 1t. is argued 
that this frequency of (X)VS is high enough to cause the 
V2 parameter to trigger to +V2. 

Middle French In Middle French, the data is not con¬ 
sistent with any of the 32 target grammars (equivalent 
to a heterogenous population). Analysis of texts from 
that period reveal that some old forms (like Adv V S) 
decreased in frequency and new forms 11 i ke. .Adv S V) 
increased. It is argued in Clark and Roberts that such 
a frequency shift causes ’’erosion” of V2, brings about, 
parameter instability and ultimately convergence to the 
grammar of Modern French. In this transition period 
(i.e. when Middle French was spoken/writfen) the data, 
is of the following form: 

adv V S [*f**l]; SVO [*1**1] or [1***0]; wh V S 
O [*l***] ; wh V s O [**l**] ; X (pro)V O [*1*11] or 
[1**10]; X V s [**1*1]; X s V [**1*0]; X S V [1***0]; 
(s)VY [*1*11] 

Thus, we have old sentence patterns like Adv V S 
(though it. decreases in frequency and becomes only 
10%), SVO, X (pro)V O and whVSO. The new sentence 
patterns which emerge at. this stage are adv S V (in¬ 
creases in frequency to become 60%), X subjclitic V, V 
subj clitic (pro)V Y (null subjects) , whV subj clitic O. 
Modern French [10100] By the eighteenth century, 
French had lost, both the V2 parameter setting as well 
as the null subject, parameter setting. The sentence pat¬ 
terns consistent, with Modern French parameter settings 
are SVO [*1**1] or [1***0], X S V [1***0], V s O [**1**]. 
Note that, this data, though consistent, with Modern 
French, will not. trigger all the parameter settings. In 
this sense, Modern French (just, like Old French) is not. 
uniquely learnable from data. However, as before, we 
shall not. concern ourselves overly with this, for the rel¬ 
evant. parameters (V2 and null subject.) are uniquely set. 
by the data. In re. 

5.3 Some Dynamical System Simulations 

We can obtain dynamical systems for this parametric 
space, for a. TLA (or TLA-like) algorithm in a. straight¬ 
forward fashion. We show the results of two simulations 
conducted with such dynamical systems. 

5.3.1 Homogeneous Populations [Initial Old 
French] 

We conducted a. simulation on this new parameter 
space using the Triggering Learning Algorithm. Recall 
that, the relevant. Markov chain in this case has 32 states. 
We start, the simulation with a. homogeneous population 
speaking Old French (parameter setting = 11011). Our 
goal was to see if misconvergence alone, could drive Old 
French to Modern French. 

Just, as before, we can observe the linguistic compo¬ 
sition of the population over several generations. It. is 
observed that, in one generation, 15 percent, of the chil¬ 
dren converge to grammar 01011; 18 percent, to grammar 



Number of Generations 

Figure 10: Evolution of speakers of different, languages 
in a. population starting off with speakers only of Old 
French. 


01111; 33 percent, to grammar 11011 (target.) and 26 per¬ 
cent. to grammar 11111 with very few having converged 
to other grammars. Thereafter, the population consists 
mostly of speakers of these 4 languages, with one im¬ 
portant. difference: 15 percent, of the speakers eventually 
lose VS. In particular, they have acquired the gram¬ 
mar 11110. Shown in fig. 10 are the percentage of the 
population speaking the 4 languages mentioned above 
as they evolve over 20 generations. Notice that, in the 
space of a. few generations, the speakers of 11011, and 
01011 have dropped out. altogether. Most, of the popula¬ 
tion now speaks language 1111 (46 percent.) and 01111 
(27 percent). Fifteen percent, of the population speaks 
11110 and there is a. smattering of other speakers. The 
population remains roughly stable in this configuration 
thereafter. 

Observations: 

1. On examining the four languages to which the 
system converges after one generation, we noice that, 
they share the same settings for the principles [Case a.s- 
signemnt under government], [pro drop], and [V2]. These 
correspond to the three parameters which are uniquely 
set. by data, from Old French. The other two parameters 
can take on any value. Consequently 4 languages are 
generated all of which satisfy the data, from Old French. 

2. Recall our earlier remark that, due to insufficient, 
data, there wefe equivalent, grammars in the parameter 
system. It. turns out. that, in this particular case, the 
grammars (01011) and (11011) are identical as far as 
their ext.ensiona.l properties are concerned; as are the 
grammars (11111) and (01111). 

3. There is subset, relation between the two sets de¬ 
scribed in (2). The grammar (11011) is in a. subset rela¬ 
tion with (11111). This explains why after a. few gener¬ 
ations most, of the population switches to either (11111) 
or (01111) (the superset, grammars). 

4. An interesting feature of the simulation is that. 15 
percent, of the population eventually acquires the gram- 
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mar (11110), i.e., they have lost the V2 parameter set¬ 
ting. This is the first sign of instability of V2 that we 
have seen in our simulations so far (for greedy algorithms 
which are psychologically preferred). Recall that for such 
algorithms, the V2 parameter was very stable in our pre¬ 
vious example. 

5.3.2 Heterogenous Populations (Mixtures) 

The earlier section showed that with no new (foreign) 
sentence patterns the grammatical system starting out 
with only Old French speakers showed some tendency to 
lose V2. However, the grammatical trajectory did not 
terminate in Modern French. In order to more closely 
duplicate this historically observed trajectory, we ex¬ 
amine alternative inital conditions. We start our sim¬ 
ulations with an initial condition which is a mixture of 
two sources; data from Old French and data from New 
French (reproducing in this sense, data similar to that 
obtained from the Middle French period). Thus chil¬ 
dren in the next generation observe new surface forms. 
Most of the surface forms observed in Middle French are 
covered by this mixture, 

Observations: 

1. On performing the simulations using the TLA as a 
learning algorithm on this parameter space, an interest¬ 
ing pattern is observed. Suppose the learner is exposed 
to sentences with 90 percent generated by Old French 
grammar (11011) and 10 percent by Modern French 
grammar (10100), within one generation 22 percent of 
the learners have converged to the grammar (11110) and 
78 percent to the grammar (11111). Thus the learn¬ 
ers set each of the parameter values to 1 except the 
V2 parameter setting. Now Modern French is a non-V2 
language; and 10 percent of data from Modern French 
is sufficient to cause 22 percent of the speakers to lose 
V2. This is the behaviour over one generation. The new 
population (consisting of 78 percent speaking grammar 
(11111) and 22 percent speaking grammar (11110)) re¬ 
mains stable for ever. 

2. Fig. 11 shows the proportion of speakers who have 
lost V2 after one generation, as a function of the propor¬ 
tion of sentences from the Modern French Source. The 
shape of the curve is interesting. For small values of 
the proportion of the Modern French source, the slope 
of the curve is greater than 1. Thus there is a greater 
tendency of speakers to lose V2 than to retain it. Thus 
10 percent of novel sentences from the Modern French 
source causes 20 percent of the population to lose V2; 
similarly 20 percent of novel sentences from the Modern 
French source causes 40 percent of the speakers to lose 
V2. This effect wears off later. This seems to capture 
computationally the intuitive notion of many linguists 
that a small change in inputs provided to children could 
drive the system towards larger change. 

3. Unfortunately, there are several shortcomings of 
this particular simulation. First, we notice that mixing 
Old and Modern French sources does not cause the de¬ 
sired (historically observed) grammatical trajectory from 
Old to Modern French (corresponding in our system to 
movement from state (11011) to state (10100) in our 
Markov Chain). Although we fold that a small injection 
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Figure 11: Tendency to lose V2 as a result of new 
word orders introduced by Modern French source in our 
Markov Model. 


of sentences from Modern French causes a larger percent¬ 
age of the population to lose V2 and gain subject clitics 
(which are historically observed phenomena), neverthe¬ 
less, the entire population retains the null subject set¬ 
ting and case assignment under government. It should 
be mentioned that Clark and Roberts argue that the 
change in case assignment under government is the driv¬ 
ing force which allows alternate parse-trees to be formed 
and causes the parametric loss of V2 and null subject. 
In this sense, it is a more fundamental change. 

4. If the dynamical system is allowed to evolve, it ends 
up in either of the two states (11111) or (11110). This is 
essentially due to the subset relations these states (lan¬ 
guages) have with other languages in the system. An¬ 
other complication in the system is the equivalence of 
several different grammars (with respect to their surface 
extensions) e.g. given the data we are considering, the 
grammars (01011) and (11011) (Old French) generate 
the same sentences. This leads to multiplicity of paths, 
convergence to more than one target grammar and gen¬ 
eral inelegance of the state-space description. 

Future Directions: There are several possibilities to con¬ 
sider here. 

1. Using more data and filling out the state-space 
might yield greater insight. Not# that we can also study 
the development of other languages like Italian or Span¬ 
ish within this framework and that might be useful. 

2. TLA-like hill climbing algorithms do not pay at¬ 
tention to the subset principle explicitly. It would be 
interesting to explicitly program this into the learning 
algorithm and observe the evolution thereafter. 

3. There are often cases when several different gram¬ 
mars generate the same sentences or atleast equally well 
fit the data. Algorithms which look only at surface 
strings are unable then to distinguish between them re¬ 
sulting in convergence to all of them with different prob¬ 
abilities in our stochastic setting. We saw an exam¬ 
ple of this for convergence to four states earlier. Clark 




and Roberts suggest an elegance criterion by looking at 
the parse-trees to decide between these grammars. This 
difference between strong generative capacity and weak 
generative capacity can easily be incorporated into the 
Markov model as well. The transition probabilites, now, 
will not depend upon the surface properties of the gram¬ 
mars alone, but also upon the elegance of derivation for 
each surface string. 

4. Rather than the evolution of the population, one 
could look at the evolution of the distribution of words. 
One can also obtain bounds on frequencies with which 
the new data in the Middle French Period must occur so 
that the correct drift is observed. 

6 Conclusions and Directions for 
Future Research 

In this paper, we have argued that any combination 
of (grammatical theory, learning paradigm) leads to a 
model of grammatical evolution and diachronic change. 
A learning theory (paradigm) attempts to account for 
how children (the individual child) solve the problem 
of language acquisition. By considering a population of 
such “child learners”, we have arrived at a model of the 
emergent, global, population behavior. The key point 
is that such a model is a logical consequence of gram¬ 
matical, and learning theories. Consequently, whenever 
a linguist suggests a new grammatical, or learning the¬ 
ory, they are also suggesting a particular evolutionary 
theory—and the consequences of this need to be exam¬ 
ined. 

Historical Linguistics and Diachronic Criteria 

From a programmatic persepective, this paper has two 
important consequences. First, it allows us to take a 
formal, analytic view of historical linguistics. Most ac¬ 
counts of language change have tended to be descriptive 
in nature (though significant exceptions are the work of 
Lightfoot, Kroch, Clark and Roberts, among others). In 
contrast, we place the study of historical linguistics (di¬ 
achronic phenomena) on a scientific 17 platform. In this 
sense, our conception of historical linguistics is closest 
in spirit to evolutionary theory and population biology 18 
(which attempts to describe the origin and changing pat¬ 
terns of life) and cosmology (which attempts to describe 
the origin and evolution of the physical universe). 

Second, it allows us to formally pose a diachronic 
criterion for the adequacy of grammatical theories. A 
significant body of work in learning theory, has al¬ 
ready sharpened the learnability criterion for grammat¬ 
ical theories—in other words, the class of grammars Q 
must be learnable by some psychologically plausible al¬ 
gorithm from primary linguistic data. Now we can go 
one step further. The class of grammars Q (along with 
a proposed learning algorithm A) can be reduced to a 

17 By scientific, we mean, the construction of models with 
explanatory, and predictive powers- models which can be 
falsified in the sense of Popper. 

18 Indeed, most previous attempts to model language 
change, like that of Clark and Roberts (1993), and Kroch 
(1990) have been influenced by the evolutionary models. 


dynamical system whose evolution must match that of 
the true evolution of human languages (as reconstructed 
from historical data). 

We have attempted to lay the framework for the devel¬ 
opment of research tools to study historical phenomena. 
To concretely demonstrate that the grammatical dynam¬ 
ical systems need not be impossibly difficult to compute 
(or simulate), we explicitly showed how to transform 
parametrized theories, and memoryless learning algo¬ 
rithms to dynamical systems. The specific simulations 
of this paper are far too incomplete to have any long 
term linguistic implications, though, we hope, it cer¬ 
tainly forms a starting point for research in this direc¬ 
tion. Nevertheless, there were certain interesting results 
obtained. 

1. We saw that the V2 parameter was more stable 
in the 3-parameter case, than it was in the 5 parameter 
case. This suggests that the loss of V2 (actually observed 
in history) might have more to do with the choice of 
parametrizations than learning algorithms, or primary 
linguistic data (though, we suggest great caution, before 
drawing strong conclusions on the basis of this study). 

2. We were able to shed some light on the time course 
of evolution. In particular, we saw how this was a deriva¬ 
tive of more fundamental assumptions about initial pop¬ 
ulation conditions, sentence distributions, and learning 
algorithms. 

3. We were able to formally develop notions of sys¬ 
tem stability. Thus, certain parameters could change 
with time, others might remain stable. This can now be 
measured, and the conditions for stability or change can 
be investigated. 

4. We were able to demonstrate how one could tinker 
with the system (by changing the algorithm, or the sen¬ 
tence distributions, or maturational time) to allow evo¬ 
lution in certain directions. This would suggest the kinds 
of changes needed in linguistics for greater explanatory 
adequacy. 

Further Research 

This has been our first attempt to define the bound¬ 
aries of the problem. There are several directions of fur¬ 
ther research. 

1. From a linguistic perspective, the most interesting 
thing to do, would perhaps be the examination of alter¬ 
native parametrized theories, and to track the change of 
certain languages in the context of these theories (much 
like our attempt to track the change of French in this 
paper). Some worthwhile attempts would include a) 
the study of parametric stress systems (Halle and Id- 
sardi, 1992)-and in particular, the evolution of modern 
Greek stress patterns from proto-Indo European; b) the 
investigation of the possibility that creoles correspond to 
fixed points in parametric dynamical systems, a possibil¬ 
ity which might explain the striking fact that all creoles 
(irrespective of the linguistic origin, i.e., initial linguistic 
composition of the population) have the same grammar; 
c) the evolution of modern Urdu, with Hindi syntax, and 
Persian vocabulary. 

2. From a mathematical perspective, one could take 
this research in many directions including a) the formal- 



ization of the update rule for other grammatical theories 
and learning algorithms, and the characterization of the 
dynamical systems implied therein b) the investigation 
of stability issues more closely, and characterizing better 
the phase-space plots c) recall that our dynamical sys¬ 
tems are multi-dimensional non-linear iterated function 
mappings—a recipe for chaotic behaviour, and a possi¬ 
bility to investigate further. 

It is our hope that research in this line will mature 
to make useful contributions, both to linguistics, and in 
view of the unusual nature of the dynamical systems in¬ 
volved, to the study of such systems from a mathematical 
perspective. 
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A The 3-parameter system of Gibson 
and Wexler (1994) 

The 3-parameter system discussed in Gibson and Wexler 
(1994) includes two parameters from Y-bar theory. 
Specifically, they relate to specifier-head relations, and 
head-complement relations in phrase structure. The fol¬ 
lowing parmetrized production rules denote this: 

XP —>■ SpecX'(pi = 0) or X 1 Spec(pi = 1) 

X' —>■ CompX'(p 2 = 0) or X'Comp(p 2 = 1) 

X' X 

A third parameter is related to verb movement. In 
German, and Dutch root declarative clauses, it is ob¬ 
served that the verb occupies exactly the second posi¬ 
tion. This Verb-Second phenomenon might or might not 
be present in the world’s languages, and this variation is 
captured by means of the V2 parameter. 

The following table provides the unembedded (degree- 
0 ) sentences from each of the 8 grammars (languages) 
obtained by setting the 3 parameters of example 1 to 
different values. The languages are referred to as L\ 
through Eg. 
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Language 

Spec 

Comp 

V2 

Begree-0 unembedded sentences 

Ai 

■ 

■ 

0 

”v s” ”v 0 s” ”v Ol 02 s” ”ATIX v s” ”aux v 0 s” 

”aux V Ol 02 s” ”adv V s” ”adv V 0 s” ”ADV V Ol 02 s” 
”ADV ATIX V S” ”ADV AUX V O S” ”ADV ATIX V Ol 02 S” 

Lt 



1 

”s v” ”s V 0 ” ”0 V s” ”S V Ol 02” 

”ol V 02 s” ”o2 V Ol s” ”s AUX v” ”S AUX V 0 ” 

”0 AUX V s” ”s AUX V Ol 02” ”ol AUX V 02 s” ”02 AUX V ol s” 
”adv V s” ”abv V 0 S” ”ADV V Ol 02 s” ”ADV AUX V s” 

”ADV AUX V O s” ”ADV AUX V Ol 02 S” 

L 3 


0 

0 

”v s” ”0 V s” ”02 Ol V S” ”v AUX s” ”0 V AUX s” 

”02 Ol V AUX s” ”adv v s” ”auv O V s” ”adv 02 Ol V s” 
”ADV V AUX s” ”ADV O V AUX S” ”ADV 02 Ol V AUX S” 

u 


0 

1 

» s v » » 0 v s” ”s v 0 ” ”s v o2 ol” ”ol v o2 s” 

”o2 V Ol s” ”s AUX v” ”s AUX O V” ”0 AUX V S” 

”s AUX 02 Ol v” ”ol AUX 02 V s” ”o2 AUX Ol V s” ”adv V s” 
”adv V 0 s” ”ADV V 02 ol s” ”ADV AUX V s” 

”ADV AUX O V s” ”ADV AUX 02 Ol V S” 

L s 

(English, 

Wench) 


■ 

0 

”s v” ”s V 0 ” ”s V Ol 02” ”s AUX v” ”s AUX V 0 ” 

”s AUX V Ol 02” ”adv s v” ”adv s V 0 ” ”adv S V Ol 02” 
”adv s AUX v” ”adv s AUX v o” ”adv S AUX v ol o2” 




1 

» s y» »g y Q» >1q y g» »g y ol 02 ” ”Ol V S 02 ” 

”02 VS Ol” ”s AUX v” ”s AUX V O” ’’O AUX S V” 

”s AUX V Ol 02” ”Ol AUX S V 02” ”02 AUX S V Ol” ”ADV V S” 
”adv V s 0 ” ”ABV V S Ol 02” ”AUV AUX s v” ”ADV AUX S V 0 ” 
”ADV AUX S V Ol 02” 

Li 

(Bengali, 

Hindi) 

0 

0 

0 

”s v” ”s 0 v” ”s 02 ol v” ”s V AUX” 

”s 02 Ol V AUX” ”adv s v” ” adv s 0 v” ”adv S 02 ol v” 
”adv S V AUX” ”adv S O V AUX” ”auv s o 2 Ol V AUX” 

L g 

(German, 

Butch) 

0 

0 

1 

”s v” ”s v 0 ” ”0 v s” ”s v o2 ol” ”ol v s o2” 

”o2 VS Ol” ”s AUX v” ”s AUX O v” ”0 AUX S V” 

”ol AUX S 02 v” ”o2 AUX s ol v” ”adv V s” ”adv V s 0 ” 
”ADV V S 02 Ol” ”ADV AUX s v” ”adv AUX s 0 v” 

”s AUX 02 ol v” ”s O V AUX” ”ADV AUX s o2 ol v” 
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